-
-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Push local pool memory to a global list at thread termination #1621
Conversation
0d95e6a
to
7b13522
Compare
{ | ||
PONY_ABA_PROTECTED(T) ret = {NULL, 0}; | ||
PONY_ABA_PROTECTED_PTR(T) ret = {NULL, 0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interlocked exchange seems wrong. Such a heavyweight operation for a load? I can't see anywhere where that would be necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, big_store
use a CAS where a store should be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because x86 doesn't have plain plain load/store operations on 16 bytes. The only instruction that can work on an operand of that size is cmpxchg16b
and if we don't use that and use 2 plain mov
instead, there would be a time window where a thread could modify one part of the value while another thread is still processing the other part.
For reference, both GCC and Clang also generate cmpxchg16b
for atomic loads/stores of 16 bytes objects.
I've tried to reduce the overhead introduced by these operations by using non-atomic loads/stores whenever possible. For example, pool.c:465. Also, the fast bailout path of both pool_pull
and pool_block_pull
doesn't use any big atomic or hardware synchronisation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what I'm thinking is, that we are using bigatomic_load
to "initialise" a 16 byte ABA protected pointer before a CAS operation. If we load the two 8 byte chunks independently (cheaply!), we will either get a matched pair (which may succeed during the CAS, and everything is fine) or an unmatched pair.
If we get an unmatched pair, the loads could be in either order. If we read a stale ABA and a current pointer, we will do a second loop on the CAS - no more expensive than the initial atomic read. If we read a current ABA and a stale pointer... same result!
The only way to provoke an error would be if 2^64 writes happen between the ABA read and the ptr read, and the 2^64 th write writes back the old pointer.
Is there another error condition that I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would work, yes. I think it will only be possible in pool_push/pull
and not in pool_block_pull/push
because in the latter ones the CAS can be on any element of the list and if we fail, we always retry from the start of the list. I'll update the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out it is also possible in pool_block_push/pull
but with an additional branching to avoid the CAS when we don't need it. Would that be an interesting tradeoff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about it, these functions won't be used really often so it probably won't make a difference. I'll include it in the change.
@@ -239,6 +239,7 @@ DECLARE_THREAD_FN(ponyint_asio_backend_dispatch) | |||
close(b->wakeup); | |||
ponyint_messageq_destroy(&b->q); | |||
POOL_FREE(asio_backend_t, b); | |||
pony_unregister_thread(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the explicit unregister approach.
d6ef876
to
ba28fee
Compare
This is a change originally suggested by @sylvanc in ponylang#1621. On x86, this results in less occurrences of the expensive cmpxchg16b instruction.
src/libponyrt/sched/mpmcq.c
Outdated
@@ -53,7 +53,7 @@ void ponyint_mpmcq_destroy(mpmcq_t* q) | |||
{ | |||
atomic_store_explicit(&q->head, NULL, memory_order_relaxed); | |||
#ifdef PLATFORM_IS_X86 | |||
PONY_ABA_PROTECTED(mpmcq_node_t*) tail = bigatomic_load_explicit(&q->tail, | |||
PONY_ABA_PROTECTED_PTR(mpmcq_node_t) tail = bigatomic_load_explicit(&q->tail, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bigatomic_load_explicit
isn't needed here either. Only the .object
is used.
src/libponyrt/sched/mpmcq.c
Outdated
@@ -40,7 +40,7 @@ void ponyint_mpmcq_init(mpmcq_t* q) | |||
|
|||
atomic_store_explicit(&q->head, node, memory_order_relaxed); | |||
#ifdef PLATFORM_IS_X86 | |||
PONY_ABA_PROTECTED(mpmcq_node_t*) tail; | |||
PONY_ABA_PROTECTED_PTR(mpmcq_node_t) tail; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for bigatomic_store_explicit
here. There is no concurrency issue when initialising the queue.
Sorry @Praetonus , I just noticed two more |
This change adds a global list of free blocks to the pool allocator, similar to the existing global list of free lists. When a runtime thread terminates, it will push its local free blocks to the global list, which can then be retreived by other threads. The goal of this change is to avoid memory leaks when starting and stopping the runtime multiple times in a program, for example in the compiler test suite. This pool allocator cleanup functionality is exposed to user threads through the new pony_unregister_thread function.
6b35cb0
to
9862796
Compare
I've updated the PR. |
Excellent, thanks @Praetonus ! |
This change adds a global list of free blocks to the pool allocator, similar to the existing global list of free lists. When a runtime thread terminates, it will push its local free blocks to the global list, which can then be retreived by other threads.
The goal of this change is to avoid memory leaks when starting and stopping the runtime multiple times in a program, for example in the compiler test suite.
This pool allocator cleanup functionnality is exposed to user threads through the new
pony_unregister_thread
function.PS: I expect that this change will introduce a performance penalty to the runtime termination, especially for long-running programs with lots of memory in free lists. I haven't done any real measurements so I don't have a precise estimate for this slowdown. If that's an issue I can modify the patch to add a runtime option for pool cleanup.